Entity Resolution with Heavy Indexing

نویسنده

  • Csaba István Sidló
چکیده

Entity resolution (ER), or deduplication is a computationally hard problem with O(n) time complexity. We reformulate ER as a search problem, and develop algorithms using efficient indices. Indices can enhance algorithm scalability, facilitate distributed processing, but require additional storage space. We study the performance and tradeoffs between index update and search in ER algorithms, and show that significant performance gain can be obtained by using indices. We also demonstrate the strength of our algorithms in the real-world scenario of an insurance customer master data creation procedure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution

Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on stati...

متن کامل

Strategies for Large-Scale Entity Resolution Based on Inverted Index Data Partitioning

Inverted indexing is a commonly used technique for improving the performance of entity resolution algorithms by reducing the number of pair-wise comparisons necessary to arrive at acceptable results. This chapter describes how inverted indexing can also be used as a data partitioning strategy to perform entity resolution on large datasets in a distributed processing environment. This chapter di...

متن کامل

A Dynamic Indexing for Incremental Entity Resolution over Query Results

Entity Resolution (ER) is the problem of identifying groups of tuples from one or multiple data sources that represent the same real-world entity. This is a crucial stage of data integration processes, which often need to integrate data at query time. This task becomes more challenging in scenarios with dynamic data sources or with a large volume of data. As most ER techniques deal with all tup...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011